<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <title>Gaussian Process Regression</title>
    <link type="text/css" rel="stylesheet" href="style.css">
  </head>

  <body>

<h2>Gaussian Process Regression</h2>

This page contains 4 sections
<ol>
<li> <a href="#desc">Description</a> of the GP regression function
<tt>gpr.m</tt></li> <li> <a href="#1-d">1-d demo</a> using gpr.m</li> <li> <a
href="#ard">Multi-dimensional demo</a> using gpr.m with Automatic Relevance
Determination (ARD)</li> <li> <a href="#ref">References</a></li>
</ol>


<h3 id="desc">1. Description of the GP regression function <tt>gpr.m</tt></h3>

The basic computations needed for standard Gaussian process regression (GPR)
are straight forward to implement in matlab. Several implementations are
possible, here we present an implementation closely resembling Algorithm 2.1
(p. 19) with three exceptions: Firstly, the predicitive variance returned is
the variance for noisy test-cases, whereas Algorithm 2.1 gives the variance for
the noise-free latent function; conversion between the two variances is done by
simply adding (or subtracting) the noise variance. Secondly, the
<em>negative</em> log marginal likelihood is returned, and thirdly the partial
derivatives of the negative log marginal likelihood, equation (5.9), section
5.4.1, page 114 are also computed.</p>

Algorithm 2.1, section 2.2, page 19:</p>
<center><img src="alg21.gif"></center><br>

A simple implementation of a Gaussian process for regression is provided by the
<a href="../gpml/gpr.m">gpr.m</a> program (which can conveniently be used
together with <a href="../gpml/minimize.m">minimize.m</a> for optimization of
the hyperparameters). The program can do one of two things:
<ol>
<li> compute the negative log marginal likelihood and its partial derivatives
wrt. the hyperparameters, usage</p>
<pre>[nlml dnlml] = gpr(logtheta, covfunc, x, y)
</pre>which is used when "training" the hyperparameters, or</li><br>
<li> compute the (marginal) predictive distribution of test inputs, usage</p>
<pre>[mu S2]  = gpr(logtheta, covfunc, x, y, xstar)
</pre></li>
</ol>
Selection between the two modes is indicated by the presence (or absence) of
test cases, <tt>xstar</tt>. The arguments to the <a
href="../gpml/gpr.m">gpr.m</a> function are:<br>
<table border=0 cols=2 width="100%">
<tr><td><b>inputs</b></td><td></td></tr>
<tr><td width="10%"><tt>logtheta</tt></td><td>a (column) vector containing the
logarithm of the hyperparameters</td></tr>
<tr><td><tt>covfunc</tt></td><td>the covariance function</td></tr>
<tr><td><tt>x</tt></td><td>a <tt>n</tt> by <tt>D</tt> matrix of training
inputs</td></tr>
<tr><td><tt>y</tt></td><td>a (column) vector if training set
targets (of length <tt>n</tt>)</td></tr>
<tr><td><tt>xstar</tt></td><td>a <tt>nn</tt> by <tt>D</tt> matrix of test
inputs</td></tr>
<tr><td><b>outputs</b></td><td></td></tr>
<tr><td><tt>nlml</tt></td><td>the negative log marginal likelihood</td></tr>
<tr><td><tt>dnlml</tt></td><td>(column) vector with 
the partial derivatives of the negative log marginal likelihood wrt. the
logarithm of the hyperparameters.</td></tr>
<tr><td><tt>mu</tt></td><td>(column) vector (of length <tt>nn</tt>) of
predictive means</td></tr>
<tr><td><tt>S2</tt></td><td>(column) vector (of length <tt>nn</tt>) of
predictive variances</td></tr>
</table><br>

The <tt>covfunc</tt> argument specifies the function to be called to evaluate
covariances. The <tt>covfunc</tt> can be specified either as a string naming
the function to be used or as a cell array. A number of covariance functions
are provided, see <a href="../gpml/covFunctions.m">covFunctions.m</a> for a
more complete description. A commonly used covariance function is the sum of
a sqaured exponential (SE) contribution and independent noise. This can be
specified as:

<pre>
  covfunc = {'covSum', {'covSEiso','covNoise'}};
</pre>

where <tt>covSum</tt> merely does some bookkeeping and calls the squared
exponential (SE) covariance function <a
href="../gpml/covSEiso.m">covSEiso.m</a> and the independent noise covariance
<a href="../gpml/covNoise.m">covNoise.m</a>. 

The squared exponential (SE) covariance function (also called the radial basis
function (RBF) or Gaussian covariance function) is given by in equation (2.31)
in section 2.3 page 19 for the scalar input case, and equation (5.1) section
5.1 page 106 for multivariate inputs.

<h3 id="1-d">2. 1-d demo using gpr.m</h3>

You can either follow the example below or run a short script <a
href="../gpml-demo/demo_gpr.m">demo_gpr.m</a>.</p>

First, add the directory containing the <tt>gpml</tt> code to your path
(assuming that your current directory is <tt>gpml-demo</tt>:

<pre>
  addpath('../gpml')
</pre>

Generate some data from a noisy Gaussian process (it is not essential to
understand the details of this):

<pre>
  n = 20;
  rand('state',18);
  randn('state',20);
  covfunc = {'covSum', {'covSEiso','covNoise'}};
  loghyper = [log(1.0); log(1.0); log(0.1)];
  x = 15*(rand(n,1)-0.5);
  y = chol(feval(covfunc{:}, loghyper, x))'*randn(n,1);      % Cholesky decomp.
</pre>

Take a look at the data:

<pre>
  plot(x, y, 'k+', 'MarkerSize', 17)
</pre>

<center><img src="figl.gif"></center><br>

Now, we compute the predictions for this data using a Gaussian process with
a covariance function being the sum of a squared exponential (SE) contribution
and an independent noise contribution:

<pre>
  covfunc = {'covSum', {'covSEiso','covNoise'}};
</pre>

To verify how many hyperparameters this covariance function expects:

<pre>
  feval(covfunc{:})
</pre>

which reurns <tt>'2+1'</tt>, ie two hyperparameters for the SE part (a
characteristic length-scale parameter and a signal magnitude parameter) and a
single parameter for the noise (noise magnitude). We specify the
hyperparameters <tt>ell=1.0</tt> (characteristic length-scale),
<tt>sigma2f=1.0</tt> (signal variance) and <tt>sigma2n=0.01</tt> (noise
variance)

<pre>
  loghyper = [log(1.0); log(sqrt(1.0)); log(sqrt(0.01))]; % this is for figure 2.5(a)
</pre>

We place <tt>201</tt> test points evenly distributed in the interval <tt>[-7.5,
7.5]</tt> and make predictions

<pre>
  xstar = linspace(-7.5,7.5,201)';
  [mu S2] = gpr(loghyper, covfunc, x, y, xstar);
</pre>

The predictive variance is computed for the distribution of test targets, which
are assumed to be as noisy as the training targets. If we are interested
instead in the distribution of the noise-free function, we subtract the noise
variance from <tt>S2</tt>

<pre>
  S2 = S2 - exp(2*loghyper(3));
</pre>

To plot the mean and two standard error (95% confidence), noise-free, pointwise
errorbars:

<pre>
  hold on
  errorbar(xstar, mu, 2*sqrt(S2), 'g');
</pre>

Alternatively, you can also display the error range in gray-scale:

<pre>
  clf
  f = [mu+2*sqrt(S2);flipdim(mu-2*sqrt(S2),1)];
  fill([xstar; flipdim(xstar,1)], f, [7 7 7]/8, 'EdgeColor', [7 7 7]/8);
  hold on
  plot(xstar,mu,'k-','LineWidth',2);
  plot(x, y, 'k+', 'MarkerSize', 17);
</pre>

which procudes<br>

<center><img src="figl1.gif"></center><br>

reproducing Figure 2.5(a), section 2.3, page 20. Note that the two standard
deviation error bars are for the noise-free function. If you rather want the
95% confidence region for the noisy observations you should not subtract the
noise variance <tt>sigma2n</tt> from the predictive variances <tt>S2</tt>. To
reproduce Figs 2.5 (b) and (c) change the log hyperparameters when calling the
prediction program to:

<pre>
  loghyper = [log(0.3); log(1.08); log(5e-5)];  % this is for figure 2.5(b)
  loghyper = [log(3.0); log(1.16); log(0.89)];  % this is for figure 2.5(c)
</pre>

respectively. To learn, or optimize the hyperparameters by maximizing the
marginal likelihood using the derivatives from equation (5.9) section 5.4.1
page 114 using <a href="../gpml/minimize.m">minimize.m</a> do:

<pre>
  loghyper = minimize([-1; -1; -1], 'gpr', -100, covfunc, x, y);
  exp(loghyper)
</pre>

Here, we initialize the log of the hyperparameters to be all at minus one. We
minimize the negative log marginal likelihood wrt. the log of the
hyperparameters (allowing <tt>100</tt> evaluations of the function (this is
more than adequate)), and report the hyperparameters:

<pre>
  1.3659
  1.5461
  0.1443
</pre>

Note that the hyperparameters learned here a close, but not identical to the
log hyperparameters <tt>1.0, 1.0, 0.1</tt> used when generating the data. The
discrepancy is partially due to the small training sample size, and partially
due to the fact that we only get information about the process in a very
limited range of input values. Predicting with the learned hyperparameters
produces:<br>

<center><img src="figlf.gif"></center><br>

which is not very different from the plot above (based on the hyperparameters
used to generate the data). Repeating the experiment with more training points
distributed over a wider range leads to more accurate estimates.</p>

Note, that above we have use the same functional form for the covariance
function, as was used to generate the data. In practise things are seldom so
simple, and one may have to try different covariance functions. Here, we try
to explore how a Matern form, with a shape parameter of 3/2 
(see eq. 4.17) would do.

<pre>
  covfunc2 = {'covSum',{'covMatern3iso','covNoise'}};
  loghyper2 = minimize([-1; -1; -1], 'gpr', -100, covfunc2, x, y);
</pre>

Comparing the value of the marginal likelihoods for the two models

<pre>
  -gpr(loghyper, covfunc, x, y)
  -gpr(loghyper2, covfunc2, x, y)
</pre>

with values of <tt>-15.6</tt> for SE and <tt>-18.0</tt> for Matern3, shows that
the SE covariance function is about <tt>exp(18.0-15.6)=11</tt> times more
probable than the Matern form for these data (in agreement with the data
generating process). The predictions from the worse Matern-based model

<pre>
  [mu S2] = gpr(loghyper2, covfunc2, x, y, xstar);
  S2 = S2 - exp(2*loghyper2(3));
</pre>  

look like this:<br>

<center><img src="figlm.gif"></center><br>

Notice how the uncertainty grows more rapidly in the vicinity of data-points,
reflecting the property that for the Matern class 
with a shape parameter of 3/2 the stochastic process is not twice mean-square
differentiable (and is thus
much less smooth than the SE covariance function).


<h3 id="ard">3. GPR demo using gpr.m with multi-dimensional input space and
ARD</h3>

This demonstration illustrates the use of a Gaussian Process
regression for a multi-dimensional input, and illustrates
the use of automatic relevance determination (ARD). It is based
on <a href="#williams-rasmussen-96">Williams and Rasmussen (1996)</a>.</p>

You can either follow the example below or run the script <a
href="../gpml-demo/demo_gparm.m">demo_gparm.m</a>.</p>

We initially consider a 2-d nonlinear robot arm mapping problem as defined in
<a href="#mackay-92">Mackay (1992)</a>

<pre>
  f(x_1,x_2) = r_1 cos (x_1) + r_2 cos(x_1 + x_2)
</pre>

where x_1 is chosen randomly in [-1.932, -0.453], x_2 is chosen randomly in
[0.534, 3.142], and r_1 = 2.0, r_2 = 1.3.  The target values are obtained by
adding Gaussian noise of variance 0.0025 to f(x_1,x_2).  Following <a
href="#neal-96">Neal (1996)</a> we add four further inputs, two of which (x_3
and x_4) are copies of x_1 and x_2 corrupted by additive Gaussian noise of
standard deviation 0.02, and two of which (x_5 and x_6) are N(0,1) Gaussian
noise variables.  Our dataset has n=200 training points and nstar=200 test
points.</p>

The training and test data is contained in the file <a
href="../gpml-demo/data_6darm.mat">data_6darm.mat</a>. The raw training data is
in the input matrix <tt>X</tt> (<tt>n=</tt>200 by <tt>D=</tt>6) and the target
vector <tt>y</tt> (200 by 1). Assuming that the current directory is
<tt>gpml-demo</tt> we need to add the path of the code, and load the data:

<pre>
  addpath('../gpml');    % add path for the code
  load data_6darm
</pre>

We first check the scaling of these variables using

<pre>
  mean(X), std(X), mean(y), std(y)
</pre>

In particular we might be concerned if the standard deviation is very different
for different input dimensions; however, that is not the case here so we do not
carry out rescaling for <tt>X</tt>.  However, <tt>y</tt> has a non-zero mean
which is not appropriate if we assume a zero-mean GP. We could add a constant
onto the SE covariance function corresponding to a prior on constant offsets,
but here instead we centre <tt>y</tt> by setting:

<pre>
  offset = mean(y);
  y = y - offset;         % centre targets around 0.
</pre>

We use Gaussian process regression with a squared exponential covariance
function, and allow a separate lengthscale for each input dimension, as in
eqs. 5.1 and 5.2.  These lengthscales (and the other hyperparameters sigma_f
and sigma_n) are adapted by maximizing the marginal likelihood (eq. 5.8)
w.r.t. the hyperparameters. The covariance function is specified by

<pre>
  covfunc = {'covSum', {'covSEard','covNoise'}};
</pre>

We now wish to train the GP by optimizing the hyperparameters.  The
hyperparameters are stored as <tt>logtheta = [log(ell_1), log(ell_2),
... log(ell_6), log(sigma_f), log(sigma_n)]</tt> (as <tt>D = 6</tt>), and are
initialized to <tt>logtheta0 = [0; 0; 0; 0; 0; 0; 0; log(sqrt(0.1))]</tt>.  The
last value means that the initial noise variance sigma^2_n is set to 0.1.  The
marginal likelihood is optimized using the <tt>minimize</tt> function.

<pre>
  logtheta0 = [0; 0; 0; 0; 0; 0; 0; log(sqrt(0.1))];
  [logtheta, fvals, iter] = minimize(logtheta0, 'gpr', -100, covfunc, X, y);
  exp(logtheta) 
</pre>

By doing <tt>exp(logtheta)</tt> we find that:

<pre>
  ell_1      1.804377
  ell_2      1.963956
  ell_3      8.884361
  ell_4     34.417657
  ell_5   1081.610451
  ell_6    375.445823
  sigma_f    2.379139
  sigma_n    0.050835
</pre>

Notice that the lengthscales ell_1 and ell_2 are short indicating that inputs
x_1 and x_2 are relevant to the task.  The noisy inputs x_3 and x_4 have longer
lengthscales, indicating they are less relevant, and the pure noise inputs x_5
and x_6 have very long lengthscales, so they are effectively irrelevant to the
problem, as indeed we would hope.  The process std deviation sigma_f is similar
in magnitude to the standard deviation of the data <tt>std(y) = 1.2186</tt>.
The learned noise standard deviation sigma_n is almost exactly equal to the
generative noise level sqrt(0.0025)=0.05.</p>

We now make predictions on the test points and assess the accuracy of these
predictions. The test data is in <tt>Xstar</tt> and <tt>ystar</tt>. Recall that
the training targets were centered, thus we must adjust the predictions by
<tt>offset</tt>:

<pre>
  [fstar S2] = gpr(logtheta, covfunc, X, y, Xstar);
  fstar = fstar + offset;  % add back offset to get true prediction
</pre>

We compute the residuals, the mean squared error (mse) and the predictive log
likelihood (pll). Note that the predictive variance for <tt>ystar</tt> includes
the noise variance, as explained on p. 18.

<pre>
  res = fstar-ystar;  % residuals
  mse = mean(res.^2)
  pll = -0.5*mean(log(2*pi*S2)+res.^2./S2)
</pre>

The mean squared error is 0.002489. Note that this is almost equal to the value
0.0025, as would be obtained from the perfect predictor, due to the added noise
with variance 0.0025.</p>
 
We can also plot the residuals and the predictive (noise-free) variance for
each test case. Note that the order along the x-axis is arbitrary.

<pre>
  subplot(211), plot(res,'.'), ylabel('residuals'), xlabel('test case')
  subplot(212), plot(sqrt(S2),'.'), ylabel('predictive std deviation'), xlabel('test case')
</pre>


<h3 id="ref">4. References</h3>

<ul>
<li id=mackay-92>D. J. C. MacKay, <a
href="http://www.inference.phy.cam.ac.uk/mackay/backprop.nc.ps.gz">A Practical
Bayesian Framework for Backpropagation Networks</a>, Neural Computation 4,
448-472, (1992)
</li>
<li id=neal-96>
R. M. Neal, <a href="http://www.cs.utoronto.ca/~radford/bnn.book.html">Bayesian
Learning for Neural Networks</a>, Springer, (1996)
</li>
<li id=williams-rasmussen-96>
C. K. I. Williams and C. E. Rasmussen, <a
href="http://www.gaussianprocess.org/#williams-rasmussen-96">Gaussian Processes
for Regression</a>, Advances in Neural Information Processing Systems 8, pp
514-520, MIT Press (1996).
</li>
</ul>

Go back to the <a href="http://www.gaussianprocess.org/gpml">web page</a> for
Gaussian Processes for Machine Learning.

<hr>
<!-- Created: Mon Oct 17 14:16:13 CEST 2005 -->
<!-- hhmts start -->
Last modified: Mon Mar 27 16:10:14 CEST 2006
<!-- hhmts end -->
  </body>
</html>
